-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update filter documentation for expressions #49309
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
python/ray/data/dataset.py
Outdated
If you can represent your filter as an expression that leverages Arrow | ||
Dataset Expression, we will be do highly optimized filtering using native | ||
Arrow interfaces. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the two below tips?
If you can represent your predicate with NumPy or pandas operations,
:meth:Dataset.map_batches
might be faster. You can implement filter by
dropping rows.
If you're reading parquet files with :meth:
ray.data.read_parquet
,
and the filter is a simple predicate, you might
be able to speed it up by using filter pushdown; see
:ref:Parquet row pruning <parquet_row_pruning>
for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for "Parquet row pruning", let's remove the corresponding section from the performance tips user guide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seemed useful to me when I saw the tip and it's valid even now. Let me remove it and upload here. Will help with discussion if it's still valid.
python/ray/data/dataset.py
Outdated
>>> ds.filter(lambda row: row["id"] % 2 == 0).take_all() | ||
[{'id': 0}, {'id': 2}, {'id': 4}, ...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe remove this since we don't want people using the fn
parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given expr is limited, I thought, we can retain it. Let me remove this one.
Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: srinathk10 <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: srinathk10 <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: srinathk10 <[email protected]>
>>> ds.filter(lambda row: row["id"] % 2 == 0).take_all() | ||
[{'id': 0}, {'id': 2}, {'id': 4}, ...] | ||
>>> ds.filter(expr="id <= 4").take_all() | ||
[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}] | ||
|
||
Time complexity: O(dataset size / parallelism) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth showing both UDF and expr based one and clearly call out that expr based one has very clear performance advantages (of skipping deserialization, etc)
Why are these changes needed?
Update filter documentation for expressions
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.